{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook walks through a basic example of using the GPU-accelerated estimators from [RAPIDS](https://rapids.ai/) cuML and [DMLC/XGBoost](https://github.com/dmlc/xgboost) with TPOT for classification tasks. You must have access to an NVIDIA GPU and have cuML installed in your environment. Running this notebook without cuML will cause TPOT to raise a `ValueError`, indicating you should install cuML.\n",
    "\n",
    "It is intended to show how the `TPOT cuML` configuration can provide significant performance benefits on medium-sized and larger datasets. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading Data\n",
    "\n",
    "This example uses the Higgs Boson [dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS) from the UC Irvine Machine Learning Repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "from tpot import TPOTClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a 2.7 GB file.\n",
    "# Please make sure you have enough space available before\n",
    "# uncommenting the code below and downloading this file.\n",
    "\n",
    "DATA_DIRECTORY = \"./\"\n",
    "DATASET_PATH = os.path.join(DATA_DIRECTORY, \"HIGGS.csv.gz\")\n",
    "\n",
    "# if not os.path.isfile(DATASET_PATH):\n",
    "#     !wget https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz -P {DATA_DIRECTORY}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This fuction is borrowed and adapted from\n",
    "# https://github.com/NVIDIA/gbm-bench/blob/master/datasets.py\n",
    "# Thanks!\n",
    "\n",
    "def prepare_higgs(nrows=None):\n",
    "    higgs = pd.read_csv(DATASET_PATH, nrows=nrows)\n",
    "    X = higgs.iloc[:, 1:].to_numpy(dtype=np.float32)\n",
    "    y = higgs.iloc[:, 0].to_numpy(dtype=np.int64)\n",
    "    return train_test_split(X, y, stratify=y, random_state=77, test_size=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running TPOTClassifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the interest of time, we'll only use a 500,000 row sample of this file. 500,000 rows is more than enough for this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "NROWS = 500_000\n",
    "X_train, X_test, y_train, y_test = prepare_higgs(nrows=NROWS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that for cuML to work correctly, you must set `n_jobs=1` (the default setting)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=110.0, style=ProgressStyle(de…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Generation 1 - Current best internal CV score: 0.730335\n",
      "Generation 2 - Current best internal CV score: 0.730335\n",
      "Generation 3 - Current best internal CV score: 0.730335\n",
      "Generation 4 - Current best internal CV score: 0.735615\n",
      "Generation 5 - Current best internal CV score: 0.7359375\n",
      "Generation 6 - Current best internal CV score: 0.7359375\n",
      "Generation 7 - Current best internal CV score: 0.7359375\n",
      "Generation 8 - Current best internal CV score: 0.7359375\n",
      "Generation 9 - Current best internal CV score: 0.736115\n",
      "Generation 10 - Current best internal CV score: 0.7361850000000001\n",
      "Best pipeline: XGBClassifier(ZeroCount(SelectPercentile(ZeroCount(input_matrix), percentile=99)), alpha=1, learning_rate=0.1, max_depth=9, min_child_weight=11, n_estimators=100, nthread=1, subsample=0.7000000000000001, tree_method=gpu_hist)\n",
      "CPU times: user 8min 15s, sys: 1min 17s, total: 9min 33s\n",
      "Wall time: 9min 39s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "TPOTClassifier(config_dict='TPOT cuML', cv=2, generations=10,\n",
       "               log_file=<ipykernel.iostream.OutStream object at 0x7f698a7b5990>,\n",
       "               population_size=10, random_state=12, verbosity=2)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# cuML TPOT setup\n",
    "SEED = 12\n",
    "GENERATIONS = 10\n",
    "POP_SIZE = 10\n",
    "CV = 2\n",
    "\n",
    "tpot = TPOTClassifier(\n",
    "    generations=GENERATIONS,\n",
    "    population_size=POP_SIZE,\n",
    "    random_state=SEED,\n",
    "    config_dict=\"TPOT cuML\",\n",
    "    n_jobs=1, # cuML requires n_jobs=1, the default\n",
    "    cv=CV,\n",
    "    verbosity=2,\n",
    ")\n",
    "\n",
    "tpot.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.73853\n",
      "CPU times: user 950 ms, sys: 36.2 ms, total: 986 ms\n",
      "Wall time: 984 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "preds = tpot.predict(X_test)\n",
    "print(accuracy_score(y_test, preds))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=110.0, style=ProgressStyle(de…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Generation 1 - Current best internal CV score: 0.7184675\n",
      "Generation 2 - Current best internal CV score: 0.7184675\n",
      "Generation 3 - Current best internal CV score: 0.7198\n",
      "Generation 4 - Current best internal CV score: 0.7210825000000001\n",
      "Generation 5 - Current best internal CV score: 0.7222999999999999\n",
      "Generation 6 - Current best internal CV score: 0.7222999999999999\n",
      "Generation 7 - Current best internal CV score: 0.7270125000000001\n",
      "Generation 8 - Current best internal CV score: 0.73546\n",
      "Generation 9 - Current best internal CV score: 0.73546\n",
      "Generation 10 - Current best internal CV score: 0.735545\n",
      "Best pipeline: XGBClassifier(OneHotEncoder(input_matrix, minimum_fraction=0.2, sparse=False, threshold=10), learning_rate=0.1, max_depth=9, min_child_weight=19, n_estimators=100, nthread=1, subsample=1.0)\n",
      "CPU times: user 10min, sys: 1min 8s, total: 11min 9s\n",
      "Wall time: 5h 17min 28s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "TPOTClassifier(cv=2, generations=10,\n",
       "               log_file=<ipykernel.iostream.OutStream object at 0x7f282044a7d0>,\n",
       "               n_jobs=-1, population_size=10, random_state=12, verbosity=2)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# Default TPOT setup with same params\n",
    "tpot = TPOTClassifier(\n",
    "    generations=GENERATIONS,\n",
    "    population_size=POP_SIZE,\n",
    "    random_state=SEED,\n",
    "    n_jobs=-1,\n",
    "    cv=CV,\n",
    "    verbosity=2,\n",
    ")\n",
    "\n",
    "tpot.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.7378900051116943\n",
      "CPU times: user 968 ms, sys: 0 ns, total: 968 ms\n",
      "Wall time: 967 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "preds = tpot.predict(X_test)\n",
    "print(accuracy_score(y_test, preds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Performance Comparison\n",
    "With the example configuration above (10 generations, population size of 10, two-fold cross validation), the `TPOT cuML` configuration provided a significant speedup while achieving essentially equivalent accuracy.\n",
    "\n",
    "The GPU-accelerated version achieved an out-of-sample accuracy of 73.85% in **fewer than 10 minutes**, while the default version achieved an accuracy of 73.79% after more than **five hours** (specific performance values will vary across runs). This kind of speedup also means you can create larger evolutionary search strategies while **still** obtaining faster results.\n",
    "\n",
    "### Hardware\n",
    "The following hardware was used for this test. Results and speedups will vary across systems and configurations.\n",
    "\n",
    "- CPU: 2x Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz (24 cores)\n",
    "- GPU: 1x NVIDIA V100 32GB"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
